D² Absolute Error Score (d2_absolute_error_score)#

d2_absolute_error_score is a baseline-relative regression score based on absolute error (L1). It is the L1 analogue of r2_score:

  • compares your squared error to the mean baseline (best constant under L2).

  • D²_absolute compares your absolute error to the median baseline (best constant under L1).

Best possible score is 1.0.

  • 1.0 → perfect predictions.

  • 0.0 → no better than always predicting the (weighted) median of y_true.

  • < 0 → worse than the median baseline (can be arbitrarily negative).


Learning goals#

  • Write the metric in math notation and interpret its values.

  • See why the median is the baseline for absolute error.

  • Implement d2_absolute_error_score from scratch in NumPy (including sample weights).

  • Use the metric while optimizing a simple L1 linear regression model.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import d2_absolute_error_score as sk_d2_absolute_error_score
from sklearn.metrics import mean_absolute_error as sk_mean_absolute_error

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

1) Definition (L1 analogue of R²)#

For targets (y_1,\dots,y_n) and predictions (\hat{y}_1,\dots,\hat{y}_n), define the mean absolute error (MAE):

[ \mathrm{MAE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| ]

Let (m) be the empirical median of (y) (or the weighted median if sample_weight is provided). The median baseline is the constant predictor:

[ \tilde{y}_i = m\quad\text{for all } i ]

Then the D² absolute error score is

[ D^2_{\mathrm{AE}}(y, \hat{y}) = 1 - \frac{\mathrm{MAE}(y, \hat{y})}{\mathrm{MAE}(y, \tilde{y})}. ]

This is exactly what scikit-learn implements:

  • d2_absolute_error_score(y_true, y_pred)

  • is the special case of d2_pinball_score(..., alpha=0.5).

Multi-output#

If (y\in\mathbb{R}^{n\times m}), scikit-learn computes a score per output and then averages them (uniformly by default).

Important edge case (constant targets)#

If all (y_i) are equal, then (\mathrm{MAE}(y, \tilde{y}) = 0).

  • perfect predictions → score 1.0

  • otherwise → scikit-learn returns 0.0 (because the baseline has zero error).

# A tiny example

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

m = np.median(y_true)
mae_model = np.mean(np.abs(y_true - y_pred))
mae_baseline = np.mean(np.abs(y_true - m))

print("median m      =", m)
print("MAE(model)    =", mae_model)
print("MAE(baseline) =", mae_baseline)
print("D²_AE manual  =", 1 - mae_model / mae_baseline)
print("D²_AE sklearn =", sk_d2_absolute_error_score(y_true, y_pred))
median m      = 2.5
MAE(model)    = 0.5
MAE(baseline) = 2.125
D²_AE manual  = 0.7647058823529411
D²_AE sklearn = 0.7647058823529411

2) Why the baseline is the median#

For an L2 loss, the best constant prediction is the mean.

For an L1 loss, the best constant prediction is the median.

Consider the optimization problem over constants (c):

[ \min_c; f(c) = \sum_{i=1}^n |y_i - c|. ]

  • (f(c)) is convex and piecewise-linear.

  • A (sub)gradient of (|y_i - c|) w.r.t. (c) is (\mathrm{sign}(c - y_i)) (with any value in ([-1,1]) allowed when (c=y_i)).

So a subgradient of (f) is:

[ \partial f(c) = \sum_{i=1}^n \mathrm{sign}(c - y_i). ]

At an optimum (c^), we need (0 \in \partial f(c^)), which happens when roughly half the mass is on each side:

  • (#{i: y_i \le c^*} \ge n/2)

  • (#{i: y_i \ge c^*} \ge n/2)

That’s exactly the defining property of a median.

# Visual: MAE of a constant predictor c is minimized at the median

y = rng.normal(loc=0.0, scale=1.0, size=101)

c_grid = np.linspace(y.min() - 1.0, y.max() + 1.0, 400)
mae_c = np.array([np.mean(np.abs(y - c)) for c in c_grid])

m = np.median(y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=mae_c, mode="lines", name="MAE(y, c)"))
fig.add_vline(x=float(m), line_dash="dash", line_color="#E45756", annotation_text="median")

fig.update_layout(
    title="For L1, the best constant predictor is the median",
    xaxis_title="constant prediction c",
    yaxis_title="MAE(y, c)",
    height=420,
)
fig.show()

print("median:", m)
print("argmin grid approx:", c_grid[np.argmin(mae_c)])
median: -0.016801157504288795
argmin grid approx: -0.01921790647673083

3) Interpreting values: 1, 0, negative#

Because D² is a ratio to the median baseline:

  • 1.0 means (\mathrm{MAE}(y,\hat{y}) = 0).

  • 0.0 means (\mathrm{MAE}(y,\hat{y}) = \mathrm{MAE}(y,\tilde{y})) (you match the baseline).

  • -1.0 means your MAE is twice the baseline MAE.

  • More generally, if (\mathrm{MAE}(y,\hat{y}) = k,\mathrm{MAE}(y,\tilde{y})) then (D^2 = 1-k).

Below we build a few predictors on the same y_true and compare them.

y_true = 1.5 + rng.normal(size=80)

m = np.median(y_true)
baseline = np.full_like(y_true, m)

models = {
    "Perfect": y_true,
    "Noisy": y_true + rng.normal(0, 0.4, size=y_true.shape),
    "Median baseline": baseline,
    "Anti-median": 2 * m - y_true,  # doubles every |y-m|, so D² = 1 - 2 = -1
}

rows = []
for name, y_pred in models.items():
    mae_model = np.mean(np.abs(y_true - y_pred))
    mae_base = np.mean(np.abs(y_true - baseline))
    d2 = sk_d2_absolute_error_score(y_true, y_pred)
    rows.append((name, mae_model, mae_base, d2))

rows
[('Perfect', 0.0, 0.7787010392304238, 1.0),
 ('Noisy', 0.33910291099295586, 0.7787010392304238, 0.5645274708659884),
 ('Median baseline', 0.7787010392304238, 0.7787010392304238, 0.0),
 ('Anti-median', 1.5574020784608475, 0.7787010392304238, -1.0)]
# Visual: MAE(model) vs MAE(baseline) and the resulting D²
names = [r[0] for r in rows]
mae_model = np.array([r[1] for r in rows])
mae_base = np.array([r[2] for r in rows])
d2 = np.array([r[3] for r in rows])

fig = make_subplots(rows=1, cols=2, subplot_titles=("Absolute errors", "D²_AE score"))

fig.add_trace(go.Bar(x=names, y=mae_model, name="MAE(model)", marker_color="#4C78A8"), row=1, col=1)
fig.add_trace(go.Bar(x=names, y=mae_base, name="MAE(median baseline)", marker_color="#F58518"), row=1, col=1)
fig.update_yaxes(title_text="MAE", row=1, col=1)

fig.add_trace(go.Bar(x=names, y=d2, name="D²_AE", marker_color="#54A24B"), row=1, col=2)
fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=2)
fig.update_yaxes(title_text="score", row=1, col=2)

fig.update_layout(barmode="group", height=420)
fig.show()

4) From-scratch NumPy implementation#

The implementation below mirrors scikit-learn’s behavior:

  • supports 1D and multi-output targets

  • supports sample_weight

  • supports multioutput aggregation:

    • 'raw_values' (per-output)

    • 'uniform_average' (default)

    • array-like weights (length = n_outputs)

We also reproduce scikit-learn’s convention for the constant-target edge case.

def _to_2d(y):
    y = np.asarray(y)
    if y.ndim == 1:
        return y.reshape(-1, 1)
    if y.ndim == 2:
        return y
    raise ValueError(f"y must be 1D or 2D, got shape {y.shape}")


def weighted_percentile_lower(array, sample_weight, percentile=50.0):
    '''Lower weighted percentile (matches sklearn.utils.stats._weighted_percentile).

    Parameters
    ----------
    array : (n,) or (n, m)
    sample_weight : (n,) or same shape as array
    percentile : float in [0, 100]

    Returns
    -------
    q : float or (m,) ndarray
    '''
    array = np.asarray(array)
    w = np.asarray(sample_weight)

    n_dim = array.ndim
    if n_dim == 0:
        return array.item()

    if array.ndim == 1:
        array = array.reshape(-1, 1)

    if w.ndim == 1:
        if w.shape[0] != array.shape[0]:
            raise ValueError("sample_weight must have shape (n_samples,)")
        w = np.tile(w.reshape(-1, 1), (1, array.shape[1]))
    elif w.shape != array.shape:
        raise ValueError("sample_weight must have shape (n_samples,) or array.shape")

    if np.any(w < 0):
        raise ValueError("sample_weight must be non-negative")

    sorted_idx = np.argsort(array, axis=0)
    sorted_array = np.take_along_axis(array, sorted_idx, axis=0)
    sorted_w = np.take_along_axis(w, sorted_idx, axis=0).astype(float)

    cdf = np.cumsum(sorted_w, axis=0)
    total = cdf[-1]

    adj = (percentile / 100.0) * total
    adj = np.asarray(adj, dtype=float)

    # percentile=0: ignore leading zeros (sklearn GH20528 behavior)
    mask = adj == 0
    if np.any(mask):
        adj[mask] = np.nextafter(adj[mask], adj[mask] + 1)

    idx = np.array([np.searchsorted(cdf[:, j], adj[j]) for j in range(array.shape[1])])
    idx = np.clip(idx, 0, array.shape[0] - 1)

    q = sorted_array[idx, np.arange(array.shape[1])]
    return q[0] if n_dim == 1 else q


def mae_raw_values(y_true, y_pred, sample_weight=None):
    '''Per-output MAE (returns shape (n_outputs,)).'''
    y_true = _to_2d(y_true).astype(float)
    y_pred = _to_2d(y_pred).astype(float)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"Shape mismatch: y_true {y_true.shape}, y_pred {y_pred.shape}")

    abs_err = np.abs(y_true - y_pred)

    if sample_weight is None:
        return abs_err.mean(axis=0)

    w = np.asarray(sample_weight, dtype=float).reshape(-1)
    if w.shape[0] != y_true.shape[0]:
        raise ValueError("sample_weight must have shape (n_samples,)")
    if np.any(w < 0):
        raise ValueError("sample_weight must be non-negative")

    w_sum = w.sum()
    if w_sum == 0:
        raise ValueError("sample_weight sum must be > 0")

    return (abs_err * w[:, None]).sum(axis=0) / w_sum


def d2_absolute_error_score_numpy(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput="uniform_average",
):
    '''NumPy implementation of sklearn.metrics.d2_absolute_error_score.'''
    y_true_2d = _to_2d(y_true)
    y_pred_2d = _to_2d(y_pred)

    if y_true_2d.shape != y_pred_2d.shape:
        raise ValueError(f"Shape mismatch: y_true {y_true_2d.shape}, y_pred {y_pred_2d.shape}")

    n = y_true_2d.shape[0]
    m = y_true_2d.shape[1]

    if n < 2:
        return float("nan")

    num = mae_raw_values(y_true_2d, y_pred_2d, sample_weight=sample_weight)

    if sample_weight is None:
        med = np.percentile(y_true_2d, q=50, axis=0)
    else:
        med = weighted_percentile_lower(y_true_2d, sample_weight=sample_weight, percentile=50)

    den = mae_raw_values(y_true_2d, np.tile(med, (n, 1)), sample_weight=sample_weight)

    scores = np.ones(m, dtype=float)

    nonzero_num = num != 0
    nonzero_den = den != 0
    valid = nonzero_num & nonzero_den

    scores[valid] = 1.0 - (num[valid] / den[valid])
    scores[nonzero_num & ~nonzero_den] = 0.0

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return scores
        if multioutput != "uniform_average":
            raise ValueError("multioutput must be 'raw_values', 'uniform_average', or array-like")
        weights = None
    else:
        weights = np.asarray(multioutput, dtype=float)
        if weights.shape != (m,):
            raise ValueError(f"multioutput weights must have shape ({m},)")

    return float(np.average(scores, weights=weights))
# Quick checks vs scikit-learn

# 1D
n = 120
y_true = rng.normal(size=n)
y_pred = y_true + rng.normal(0, 0.5, size=n)

print("1D")
print("numpy :", d2_absolute_error_score_numpy(y_true, y_pred))
print("sklearn:", sk_d2_absolute_error_score(y_true, y_pred))

# Weighted
w = rng.uniform(0.2, 2.0, size=n)
print()
print("Weighted")
print("numpy :", d2_absolute_error_score_numpy(y_true, y_pred, sample_weight=w))
print("sklearn:", sk_d2_absolute_error_score(y_true, y_pred, sample_weight=w))

# Multioutput
Y_true = rng.normal(size=(80, 3))
Y_pred = Y_true + rng.normal(0, 0.8, size=(80, 3))

print()
print("Multioutput (raw_values)")
print("numpy :", d2_absolute_error_score_numpy(Y_true, Y_pred, multioutput="raw_values"))
print("sklearn:", sk_d2_absolute_error_score(Y_true, Y_pred, multioutput="raw_values"))

print()
print("Multioutput (weighted average)")
weights = np.array([0.2, 0.3, 0.5])
print("numpy :", d2_absolute_error_score_numpy(Y_true, Y_pred, multioutput=weights))
print("sklearn:", sk_d2_absolute_error_score(Y_true, Y_pred, multioutput=weights))
1D
numpy : 0.4731975975371019
sklearn: 0.4731975975371019

Weighted
numpy : 0.4452618667501379
sklearn: 0.4452618667501379

Multioutput (raw_values)
numpy : [0.3068 0.2318 0.2531]
sklearn: [0.3068 0.2318 0.2531]

Multioutput (weighted average)
numpy : 0.2574517183454593
sklearn: 0.2574517183454593

5) Using D² while optimizing a model (from scratch)#

On a fixed dataset ({(x_i,y_i)}_{i=1}^n), the baseline term

[ \mathrm{MAE}(y, \tilde{y}) = \mathrm{MAE}\bigl(y,; \text{median}(y)\bigr) ]

depends only on (y), not on the model parameters (\theta). Therefore:

[ D^2_{\mathrm{AE}}(\theta) = 1 - \frac{\mathrm{MAE}(\theta)}{\mathrm{MAE}_\text{baseline}} ]

and maximizing (D^2_{\mathrm{AE}}) is equivalent to minimizing MAE.

L1 linear regression#

With a linear model (\hat{y} = Xw), the MAE objective is:

[ L(w) = \frac{1}{n}\sum_{i=1}^n |(Xw)_i - y_i|. ]

A subgradient is

[ \nabla_w L(w) \in \frac{1}{n}X^\top s,\quad s_i \in \mathrm{sign}((Xw)_i - y_i), ]

where (\mathrm{sign}(0)) can be any value in ([-1,1]). In code we use np.sign, which returns 0 at exactly 0 residual.

Below we fit an L1 regression model with subgradient descent and track both MAE and (D^2_{\mathrm{AE}}).

# Data with outliers: OLS (L2) can be pulled around; L1 is more robust.

n = 250
x = rng.uniform(-3, 3, size=n)
y = 2.0 + 1.5 * x + rng.normal(0, 1.0, size=n)

# Add a handful of strong outliers
out_idx = rng.choice(n, size=18, replace=False)
y[out_idx] += rng.normal(12, 4, size=out_idx.shape[0])

X = np.column_stack([np.ones(n), x])

# Baseline
m = np.median(y)
y_baseline = np.full_like(y, m)
mae_baseline = np.mean(np.abs(y - y_baseline))

# OLS (minimizes MSE, not MAE)
w_ols, *_ = np.linalg.lstsq(X, y, rcond=None)
y_hat_ols = X @ w_ols

print("MAE baseline:", mae_baseline)
print("D²_AE (OLS):", sk_d2_absolute_error_score(y, y_hat_ols))
MAE baseline: 3.1399640289860704
D²_AE (OLS): 0.3731633735620893
def fit_l1_linear_subgradient(X, y, *, lr0=0.4, n_steps=600):
    '''Minimize MAE(y, Xw) with subgradient descent.'''
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float).reshape(-1)

    n, d = X.shape
    w = np.zeros(d)

    mae_hist = []
    d2_hist = []

    # Constant baseline for D²
    m = np.median(y)
    mae_base = np.mean(np.abs(y - m))

    for t in range(n_steps):
        y_hat = X @ w
        r = y_hat - y

        # Subgradient of MAE
        g = (X.T @ np.sign(r)) / n

        lr = lr0 / np.sqrt(t + 1)  # simple diminishing step size
        w = w - lr * g

        mae = np.mean(np.abs(r))
        d2 = 1.0 if mae == 0 else 1.0 - mae / mae_base

        mae_hist.append(mae)
        d2_hist.append(d2)

    return w, np.array(mae_hist), np.array(d2_hist)


w_l1, mae_hist, d2_hist = fit_l1_linear_subgradient(X, y)

y_hat_l1 = X @ w_l1

print("w_ols:", w_ols)
print("w_l1 :", w_l1)

print("D²_AE (L1):", sk_d2_absolute_error_score(y, y_hat_l1))
w_ols: [2.8653 1.5073]
w_l1 : [2.0334 1.5742]
D²_AE (L1): 0.44939361486933627
# Optimization diagnostics
iters = np.arange(len(mae_hist))

fig = make_subplots(rows=1, cols=2, subplot_titles=("D²_AE vs iteration", "MAE vs iteration"))

fig.add_trace(go.Scatter(x=iters, y=d2_hist, mode="lines", name="D²_AE"), row=1, col=1)
fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=1)
fig.update_xaxes(title_text="iteration", row=1, col=1)
fig.update_yaxes(title_text="score", row=1, col=1)

fig.add_trace(go.Scatter(x=iters, y=mae_hist, mode="lines", name="MAE"), row=1, col=2)
fig.update_xaxes(title_text="iteration", row=1, col=2)
fig.update_yaxes(title_text="MAE", row=1, col=2)

fig.update_layout(height=420)
fig.show()
# Fit visualization
x_line = np.linspace(x.min(), x.max(), 200)
X_line = np.column_stack([np.ones_like(x_line), x_line])

y_line_ols = X_line @ w_ols
y_line_l1 = X_line @ w_l1

y_line_baseline = np.full_like(x_line, m)

colors = np.where(np.isin(np.arange(n), out_idx), "#E45756", "#4C78A8")

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="markers",
        name="data",
        marker=dict(size=6, opacity=0.65, color=colors),
    )
)
fig.add_trace(go.Scatter(x=x_line, y=y_line_baseline, mode="lines", name="median baseline", line=dict(dash="dot")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_ols, mode="lines", name="OLS (L2)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_l1, mode="lines", name="L1 via MAE", line=dict(width=3)))

fig.update_layout(
    title="L1 (MAE) fit is less sensitive to outliers (red points)",
    xaxis_title="x",
    yaxis_title="y",
    height=520,
)
fig.show()

6) Pros, cons, and when to use#

Pros#

  • Baseline-relative and interpretable: 0 = “no better than predicting the median”.

  • Robust to outliers (compared to L2-based scores): absolute error grows linearly.

  • Scale- and shift-invariant: if you scale/shift both y_true and y_pred, the score stays the same.

  • Connects to quantile regression: it is d2_pinball_score(alpha=0.5).

Cons / pitfalls#

  • Non-smooth objective: optimizing MAE (and thus D²) needs subgradients / specialized solvers (or smooth approximations like Huber).

  • Can be negative and unbounded below: easy to misread if you expect a ([0,1]) score.

  • Not meaningful for a single sample: returns NaN for n_samples < 2.

  • Constant targets edge case: if y_true is constant, the baseline error is 0; scikit-learn returns 1 for perfect predictions, otherwise 0.

Good use cases#

  • You care about typical error magnitude (median-like behavior), not rare extreme misses.

  • Data has heavy tails / outliers and you want a robustness-oriented score.

  • Model selection where a baseline-relative, unitless score is easier to compare than raw MAE.

Exercises#

  1. Create a dataset where OLS has higher but lower D²_AE than an L1 fit. Explain why.

  2. Show numerically that d2_absolute_error_score is identical to d2_pinball_score(alpha=0.5).

  3. Implement a Huber regression from scratch and compare its D²_AE to OLS and L1.

  4. With sample weights, build an example where the weighted median baseline shifts drastically and changes the score.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.d2_absolute_error_score.html

  • scikit-learn user guide (D²): https://scikit-learn.org/stable/modules/model_evaluation.html#d2-score

  • Koenker & Machado (1999), quantile regression goodness-of-fit: https://doi.org/10.1080/01621459.1999.10473882